Assignment 2

options(repos = list(CRAN="http://cran.rstudio.com/"))

# Load packages here
library("ggthemes")
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library("ggplot2")
library("plyr")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:purrr':
## 
##     compact
library("dplyr")
library("plotly")
## 
## Attaching package: 'plotly'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library("gganimate")
library("scales")
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

1. Chaos (19 points)

In this exercise, the goal is to create one of the most famous plots in chaos theory. The equation of the logistic map is very simple, but its behaviour stunningly complex:

\[ x_{n+1} = rx_{n}(1-x_{n}) \]

Starting with an initial value of \(x_{0}\) between one and zero, e.g. 0.5, and setting a constant value of r e.g. between zero and four, the equation is iterated forward and thereby computes \(x_{1}, x_{2}\), etc. We will only care about visualisation here, but if you are interested in learning more about the background of the equation and plot, e.g. have a look at this or this video.

The goal is to create a plot with different values of r on the x-axis and then x values on the y-axis corresponding to each r value. In parts of the plot, all these x values will be on a single point, but for other r values x moves perpetually.

The following code chunk computes the main dataset of the plot for you. You are welcome to study the code, but this is not part of the assignment and you do not have to worry about how exactly it works (this is not a course about chaos theory after all). Data contained in logistic_map_data is already in a tidy format, one variable denotes the value of r, one variable the value of the associated x’s. For each value of r repeated over \(n=1000\) rows, there are \(n\) associated rows of x values (these can be constant or fluctuating, depending on the value of r). Only some information for the colour still has to be added.

# x observations for each r value
n <- 1000
# Step between each r value
r_step <- 0.001

r_range <- seq(2.5, 4, by = r_step)
to_discard <- 500 # numbers of observations discarded before the n which are stored
logistic_map_data <- matrix(0, nrow = n*length(r_range), 2)
for (r in r_range) {
  
  current_logistic_map_series <- numeric(n+to_discard)
  current_logistic_map_series[1] <- 0.5
  
  for (k in 1:(n+to_discard-1)) {
    
    current_logistic_map_series[k+1] <- r*current_logistic_map_series[k]*(1-current_logistic_map_series[k])
    
  }
  
  start_index <- 1+n*(match(r, r_range) - 1)
  end_index <- n*match(r, r_range)
  
  logistic_map_data[start_index:end_index,1] <- r
  logistic_map_data[start_index:end_index,2] <- tail(current_logistic_map_series,n)

}

logistic_map_data <- as_tibble(data.frame(logistic_map_data))
colnames(logistic_map_data) <- c("r", "x")

Hint: Create your final dataset with n <- 1000 and r_step <- 0.001, however, for these values it takes R some time to compute the plot. When building your plot, adjusting axes, colours, etc., one approach is to first use e.g. n <- 10 and r_step <- 0.01 until you have a version of the plot that you are happy with. Just note that the opacity parameter will have to be decreased again once you have increased n because now there are more points in the plot.

# Your code here

head(logistic_map_data)
## # A tibble: 6 × 2
##       r     x
##   <dbl> <dbl>
## 1   2.5   0.6
## 2   2.5   0.6
## 3   2.5   0.6
## 4   2.5   0.6
## 5   2.5   0.6
## 6   2.5   0.6
#Creating a point plot with r and x
pBW <- ggplot(logistic_map_data, aes(x = r, y = x, colour=r )) + 
        geom_point(size = 0.01, alpha=0.01 )
                    
#Removing y-axis label, graph background and legend
pStripped <- pBW + theme(panel.background = element_blank(), axis.text.y=element_blank(),
      axis.ticks.y=element_blank(), axis.title.y = element_blank(), 
      legend.position = "None")

#Adding colour to the graph
pRainbow <- pStripped + binned_scale(scale_name="stepsn",palette = function(x) c("#FF6161", "#B7B730","#25BF25","#91DEDE","#CAD4FF","#FFB3F3"),breaks=c(3.5,3.6,3.7,3.8,3.9),aesthetics = "color")

pRainbow

2. Popularity metrics by party and gender (19 points)

In this exercise, try to replicate the following figure that displays the average popularity metrics of legislators grouped by gender and party. Note that this example first involves some reshaping of the data which you can do with dplyr from the tidyverse.

# Data for the plot
fb <- read.csv("data/fb-congress-data.csv", stringsAsFactors=FALSE)
# Your code here
table(fb$party)
## 
##    Democrat Independent  Republican 
##        4934          47        5019
#Choosing only relevant columns and party
fb <- fb[ , c(5, 6, 7, 8, 9, 10, 11, 12, 13, 15)]
fb <- fb[fb$party == 'Democrat' | fb$party == 'Republican',]

table(fb$party)
## 
##   Democrat Republican 
##       4934       5019
colnames(fb)
##  [1] "likes_count"    "comments_count" "shares_count"   "love_count"    
##  [5] "haha_count"     "wow_count"      "angry_count"    "sad_count"     
##  [9] "gender"         "party"
#Joining gender and party
fb <- fb %>%
  unite("GP", gender:party)

#Gathering data
new_fb <- gather(fb, key="measure", value="value", c("likes_count", "comments_count", "shares_count", "love_count", "haha_count", "wow_count", "angry_count", "sad_count"))

#Changing strings names data
new_fb$GP = gsub("M_Democrat","D-M",new_fb$GP)
new_fb$GP = gsub("F_Democrat","D-F",new_fb$GP)
new_fb$GP = gsub("M_Republican","R-M",new_fb$GP)
new_fb$GP = gsub("F_Republican","R-F",new_fb$GP)

new_fb = new_fb %>% 
  mutate(value=replace_na(value,0))


#Plotting bar chart
ggplot(new_fb,aes(x=GP, y =value, fill= GP))+
  geom_bar(stat = "summary",fun = 'mean')+
  theme(legend.position = "none")+
  facet_wrap(~measure,scales = "free",nrow=2)+guides(fill="none")+
    labs(title = "Partisan asymmetries by gender in Facebook popularity metrics",
       subtitle = "Female Democrats receive more engagement than Male Democrats. The opposite is true for Republicans",
       x = "Party and Gender of Member of Congress",
       y = "Average of each type of social metric")+
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = c("dark blue", "blue", "dark red", "red"), aesthetics = "fill")+
  theme_minimal()

3. Ideology of presidential candidates in the US (22 points)

For this exercise, try to replicate the plot below, which Pablo Barbera prepared for a Washington Post blog post a few years ago.

The plot combines two sources of data: The ideology estimates for each actor (available in ideology_1.csv) and a random sample of ideology estimates for the three density plots (in ideology_2.csv).

As a clue, Pablo used theme_tufte from the ggthemes package as main theme (which he then edited manually). But there may be other ways of replicating it.

# Data for main plot
ideology <- read.csv("data/ideology_1.csv")
view(ideology)

# Data for background plots
bg <- read.csv("data/ideology_2.csv")
head(bg)
##    ideology       type
## 1 0.9740434 Republican
## 2 0.6166325 Republican
## 3 0.3337557 Republican
## 4 0.9350899 Republican
## 5 0.6329505 Republican
## 6 0.2592620 Republican
# Your code here

#Plotting main plot

#Generate row number or row index to table
ideology <- dplyr::mutate(ideology, id = (row_number()/20))
head(ideology)
##      screen_name   twscore              type      party twscore.sd   id
## 1       @tedcruz 0.8946006 Primary Candidate Republican 0.10787669 0.05
## 2 @RealBenCarson 0.8941568 Primary Candidate Republican 0.11932281 0.10
## 3   @ScottWalker 0.8771619 Primary Candidate Republican 0.08211766 0.15
## 4      @RandPaul 0.8285642 Primary Candidate Republican 0.09440381 0.20
## 5  @rushlimbaugh 0.7938614     Media Outlets          Z 0.10337143 0.25
## 6   @BobbyJindal 0.7871339 Primary Candidate Republican 0.07293965 0.30
#Baseline
i <- ggplot(ideology, mapping = aes(x=twscore, y=id, xmin=twscore-twscore.sd, xmax=twscore+ twscore.sd, colour = party, label=screen_name)) + 
    geom_pointrange(size=0.1) + 
     geom_text(size=2, hjust=1,nudge_x=-0.15)

#Adding colour
cols <- c('Republican' = 'red', 'Z' = 'black', 'Democrat' = 'blue')
iCol <- i + scale_color_manual(name = 'Party',values= cols)

#Adding theme and Removing background axis
iStripped <- iCol + theme_tufte() + theme(axis.text.y=element_blank(),
      axis.ticks.y=element_blank(), axis.title.y = element_blank(), 
      legend.position = "None") + labs(x = "Position on latent ideological scale", subtitle="Twitter Ideology scores of potential Democratic and Republican presidential primary candidates")

iStripped

#Plotting background plots

#Baseline
l <- ggplot(bg, aes(x=ideology, fill=type)) +
       geom_density(alpha=0.2, color=NA) +
       scale_x_continuous(limits = c(-2.5, 2.5))

#Adding colour and theme
lCol <- l + scale_color_manual(name = 'Party',values= cols, aesthetics = "fill") + theme_tufte()

#Calculating average
avg <- ddply(bg, "type", summarise, grp.mean=mean(ideology))
avg <- avg[-3,]

#Adding intercepts
lLines <- lCol + geom_vline(xintercept= 0, size=0.1, color="black") + geom_vline(data=avg, aes(xintercept=grp.mean, color=type), size=0.1) + scale_color_manual(values=c("blue", "red", "black"))

#Adding labels to intercepts
lLabel <- lLines + annotate("text", x=0.1, y=1, label= "Average Twitter Score", size=2.5, angle = 90) +
    annotate("text", x=-1.1, y=0.5, label= "Average Democrate\n in 114th congress", size=2.5, angle = 90) + 
    annotate("text", x=0.7, y=1.2, label= "Average Republican\n in 114th congress", size=2.5, angle = 90)

#Adding theme and Removing background axis
lStripped <- lLabel + theme_tufte() + theme(axis.text.y=element_blank(),
      axis.ticks.y=element_blank(), axis.title.y = element_blank(), 
      legend.position = "None") + labs(x = "Position on latent ideological scale", subtitle="Twitter Ideology scores of potential Democratic and Republican presidential primary candidates")
lStripped
## Warning: Removed 392 rows containing non-finite values (stat_density).

I attempted to plot both graphs together in one, unfortunately I was not able to fix the sizing and proportion. Code for plotting both on same scale below:

#Merging plots together
lfinal <- lStripped + geom_pointrange(data=ideology, mapping=aes(x=twscore,y=id, xmin=twscore-twscore.sd, xmax=twscore+twscore.sd, colour = party))

lfinal2 <- lfinal + 
          geom_text(ideology, mapping = aes(x=twscore, y=id, colour = party, label=screen_name,  hjust=1,nudge_x=-0.15))
## Warning: Ignoring unknown aesthetics: nudge_x
lfinal2
## Warning: Removed 392 rows containing non-finite values (stat_density).

4. Own visualisation (40 points)

In this exercises you can visualise data about a topic you are interested in.

First download the data. If you are looking for ideas, e.g. have a look at health data from the World Health Organization, economic data from the St. Louis Federal Reserve, or data on wealth and income inequality from the World Inequality Database. Once you have downloaded the data, load it into R and process it, and then explore and illustrate it with plots created with ggplot2 and/or plotly. You can also add brief explanations through markdown text.

More extensive, carefully thought out, polished, and well understandable answers will receive more points.

Note: Some of these data can also be obtained via APIs, but you can just manually download files such as .csv. for this assignment. This has no effect on the grade.

For this exercise, I am using an open source dataset which contains ratings of movies from year 2007 to 2011.

# Your code here
dataset <- read.csv("data/Movie_Ratings_data.csv")
head(dataset)
##                    Film     Genre Rotten.Tomatoes.Ratings.. Audience.Ratings..
## 1 (500) Days of Summer     Comedy                        87                 81
## 2           10,000 B.C. Adventure                         9                 44
## 3            12 Rounds     Action                        30                 52
## 4             127 Hours Adventure                        93                 84
## 5             17 Again     Comedy                        55                 70
## 6                  2012    Action                        39                 63
##   Budget..million... Year.of.release
## 1                  8            2009
## 2                105            2008
## 3                 20            2009
## 4                 18            2010
## 5                 20            2009
## 6                200            2009
colnames(dataset) <- c("Film", "Genre", "CriticalRating", "AudienceRating", "BudgetMillions", "Year")

#Analysing tail of dataset
tail(dataset)
##                           Film    Genre CriticalRating AudienceRating
## 557              Your Highness   Comedy             26             36
## 558            Youth in Revolt   Comedy             68             52
## 559 Zack and Miri Make a Porno  Romance             64             70
## 560                     Zodiac Thriller             89             73
## 561                Zombieland    Action             90             87
## 562                  Zookeeper   Comedy             14             42
##     BudgetMillions Year
## 557             50 2011
## 558             18 2009
## 559             24 2008
## 560             65 2007
## 561             24 2009
## 562             80 2011
#Analysing structure of dataset
str(dataset)
## 'data.frame':    562 obs. of  6 variables:
##  $ Film          : chr  "(500) Days of Summer " "10,000 B.C." "12 Rounds " "127 Hours" ...
##  $ Genre         : chr  "Comedy" "Adventure" "Action" "Adventure" ...
##  $ CriticalRating: int  87 9 30 93 55 39 40 50 43 93 ...
##  $ AudienceRating: int  81 44 52 84 70 63 71 57 48 93 ...
##  $ BudgetMillions: int  8 105 20 18 20 200 30 32 28 8 ...
##  $ Year          : int  2009 2008 2009 2010 2009 2009 2008 2007 2011 2011 ...
#Getting stats for the dataset
summary(dataset)
##      Film              Genre           CriticalRating AudienceRating 
##  Length:562         Length:562         Min.   : 0.0   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.:25.0   1st Qu.:47.00  
##  Mode  :character   Mode  :character   Median :46.0   Median :58.00  
##                                        Mean   :47.4   Mean   :58.83  
##                                        3rd Qu.:70.0   3rd Qu.:72.00  
##                                        Max.   :97.0   Max.   :96.00  
##  BudgetMillions       Year     
##  Min.   :  0.0   Min.   :2007  
##  1st Qu.: 20.0   1st Qu.:2008  
##  Median : 35.0   Median :2009  
##  Mean   : 50.1   Mean   :2009  
##  3rd Qu.: 65.0   3rd Qu.:2010  
##  Max.   :300.0   Max.   :2011
#Getting all column names
colnames(dataset)
## [1] "Film"           "Genre"          "CriticalRating" "AudienceRating"
## [5] "BudgetMillions" "Year"
#Calculating average of ratings for each genre
avgGenre<- ddply(dataset, "Genre", summarise, Audience.mean=mean(AudienceRating), Critical.mean=mean(CriticalRating))
avgGenre
##       Genre Audience.mean Critical.mean
## 1    Action      58.72078      44.40260
## 2 Adventure      62.72414      53.10345
## 3    Comedy      56.40698      44.91860
## 4     Drama      64.42574      56.47525
## 5    Horror      47.38776      34.57143
## 6   Romance      62.33333      48.19048
## 7  Thriller      65.58333      59.08333

Using plotly, the visualisation represents the whole dataset. The critical rating can be compared to audience rating; sizes of the point represents budget size, and each genre is given a specific colour.

#Plotting critical rating against audience rating, size represent budget and colour genre
movies <- ggplot(dataset, aes(x=CriticalRating, y=AudienceRating, colour=Genre, size=BudgetMillions, label=Film)) +
    geom_point(alpha=0.5) 

# Bigger Bubble represent movies with a bigger budget
IMovies <- ggplotly(movies)
IMovies

For further analysis and to make it more engaging, the animated visualisation below analyses how ratings change across the years 2007 to 2011 compared to its average for each genre. The library gganimate was used together with plotly to achieve the results.

From the animated visualisation, it is easy to compare the number of movies released across the years. It is also apparent that action movies have the most releases and high budget films. Romance genre has the least number of releases.

With the line intercepts present in each graph, it is also easy to analyse how each film’s rating change compared to the average. As seen, Thriller and Drama genres have the highest audience and critical ratings.

Reproduced from ThomasP85(version 1.0.8.9), R Graph Gallery. https://github.com/thomasp85/gganimate

movies <- ggplot(dataset, aes(CriticalRating, AudienceRating, size=BudgetMillions, colour = Genre, label=Film)) +
   geom_point(alpha = 0.7, show.legend = FALSE) +
#Adding average intercepts
   geom_vline(data=avgGenre, aes(xintercept=Critical.mean, color=Genre),
              size=0.4) +
  geom_hline(data=avgGenre, aes(yintercept=Audience.mean, color=Genre),
             size=0.4) +
#Adding annotations
  annotate("text", x=50, y=15, label= "Average Critical Rating", size=3, 
           angle = 90, color='grey') +
  annotate("text", x=1, y=58, label= "Average Audience Rating", size=3, 
           angle = 0, color='grey') +
#Size and scale range
  scale_size(range = c(1, 13)) +
  scale_x_log10(oob = scales::squish_infinite) +
  facet_wrap(~Genre, nrow = 2) +
#Adding labels and titles 
  labs(title = 'Year: {frame_time}', x = 'Critical Rating', 
       y = 'Audience Rating', 
       subtitle = 'Ratings of different genre movies from 2007 to 2011') 


#Animating using gganimate 
Anim_movies <- movies +  transition_time(Year) +
  ease_aes('linear')  +
  theme_few()

#Animate
animate(Anim_movies, width=1200, height=1000)
## Warning: Transformation introduced infinite values in continuous x-axis